Most databases offer a choice of several transaction isolation levels, offering a tradeoff between correctness and performance. However, that performance comes at a price, as developers must study their transactional interactions carefully or risk introducing subtle bugs. CockroachDB provides strong (“SERIALIZABLE
”) isolation by default to ensure that your application always sees the data it expects. In this post I'll explain what this means and how insufficient isolation impacts real-world applications.
The SQL standard defines four isolation levels:
SERIALIZABLE
REPEATABLE READ
READ COMMITTED
READ UNCOMMITTED
SERIALIZABLE
transactions run as if only one transaction were running at a time; the other isolation levels allow what the SQL standard euphemistically calls "the three phenomena": dirty reads, non-repeatable reads, and phantom reads. Subsequent research has identified additional "phenomena" and isolation levels.
In modern research, these "phenomena" are more commonly called "anomalies", or more bluntly, "lies". When you use a non-SERIALIZABLE
isolation level, you're giving the database permission to return an incorrect answer in the hope that it will be faster than producing the correct one. The SQL standard recognizes that this is dangerous and requires that SERIALIZABLE
is the default isolation level. Weaker isolation levels are provided as a potential optimization for applications that can tolerate these anomalies.
Most databases ignore the specification that SERIALIZABLE
be the default, and instead prioritize performance over safety by defaulting to the weaker READ COMMITTED
or REPEATABLE READ
isolation levels. More worryingly, some databases (including Oracle, and PostgreSQL prior to version 9.1) do not provide a serializable transaction implementation at all. Oracle's implementation of the SERIALIZABLE
isolation level is actually a weaker mode called "snapshot isolation".
Snapshot isolation was developed after the initial standardization of the SQL language, but has been implemented in multiple database systems because it provides a good balance of performance and consistency. It is stronger than READ COMMITTED
but weaker than SERIALIZABLE
. It is similar to REPEATABLE READ
but not exactly equivalent (REPEATABLE READ
permits phantom reads but prevents write skew, while the reverse is true of snapshot isolation). The databases that have implemented snapshot isolation have made different decisions about how to fit it into the four SQL standard levels. Oracle takes the most aggressive stance, calling their snapshot implementation SERIALIZABLE
. CockroachDB and Microsoft SQL Server are conservative and treat SNAPSHOT
as a separate fifth isolation level. PostgreSQL (since version 9.1) falls in between, using snapshot isolation in place of REPEATABLE READ
.
Because serializable mode is used less often in databases that default to weaker isolation, it is often less thoroughly tested or optimized. For example, PostgreSQL has a fixed-size memory pool that it uses to track conflicts between serializable transactions, which can be exhausted under heavy load.
Most database vendors treat stronger transaction isolation as an exotic option to be enabled by applications with exceptional consistency needs. Most applications, however, are expected to work with the faster but unsafe weak isolation modes. This backwards approach to the problem exposes applications to a variety of subtle bugs. At Cockroach Labs, we like thinking about transactional anomalies so much that we named all our conference rooms after them, but I would have a hard time advising with confidence when it is both safe and beneficial to choose SNAPSHOT
isolation instead of SERIALIZABLE
. Our philosophy is that it's better to start with safety and work towards performance than the other way around.
Recent research at Stanford has explored the degree to which weak isolation leads to real-world bugs. Todd Warszawski and Peter Bailis examined 12 eCommerce applications and found 22 bugs related to transactions, five of which would have been avoided by running at a higher isolation level. Many of these bugs were simple to exploit and had direct financial implications. For example, in five of the tested applications, adding an item to your cart while checking out in another browser tab could result in the item being added to the order for free. The researchers developed tools to identify these vulnerabilities in a semi-automated way, paving the way for similar attacks (which the researchers dubbed "ACIDRain") to become more prevalent.
Most databases that default to weak transactional isolation provide workarounds, such as the (non-standard) FOR UPDATE
and LOCK IN SHARE MODE
modifiers for SELECT
statements. When used correctly, these modifiers can make transactions safe even in weaker isolation levels. However, this is easy to get wrong, and even when used consistently these extensions introduce most of the downsides of SERIALIZABLE
mode (in fact, overuse of SELECT FOR UPDATE
in a READ COMMITTED
transaction can perform worse than a SERIALIZABLE
transaction, because it uses exclusive locks where serializability may only require shared locks). The ACIDRain research demonstrates the limitations of this technique: only one in three of the applications that attempted to use SELECT FOR UPDATE
feature did so correctly; the others remained vulnerable.
Databases that encourage the use of weaker isolation levels have prioritized performance over the safety of your data, leaving you to study subtle interactions between your transactions and implement error-prone workarounds. CockroachDB provides SERIALIZABLE
transactions by default to ensure that you always see the consistency that you expect from a transactional database.
Are distributed transactions your jam? Our engineering team is hiring! Check out our open positions here.
Illustration by Lisk Feng
Editor's Note: This post was originally authored when CockroachDB was pre-1.0. CockroachDB's architecture has undergone …
Read more
Almost a year ago, we wrote about our use of Jepsen in testing CockroachDB. As we prepare for CockroachDB 1.0, we …
Read more
In August, we published a blog post entitled “Why Can’t I Run a 100-Node CockroachDB Cluster? …
Read more